Introduction to Semi-Supervised Learning

Semi-supervised learning is a machine learning paradigm that falls between supervised and unsupervised learning. It leverages both labeled and unlabeled data for training, using a small amount of labeled data along with a large amount of unlabeled data.

The Need for Semi-Supervised Learning

In many real-world scenarios, obtaining labeled data is:

Expensive: Requires human experts to annotate
Time-consuming: Manual labeling can take significant time
Sometimes impossible: Some domains have inherent labeling constraints

Meanwhile, unlabeled data is typically:

Abundant: Can be collected automatically
Inexpensive: No human annotation required
Contains valuable information: Reveals underlying data distribution

Semi-supervised learning bridges this gap by leveraging both types of data.

Core Assumptions

Semi-supervised learning relies on specific assumptions about the relationship between data distribution and the target function:

1. Smoothness Assumption

Points that are close to each other are likely to have the same label
The decision boundary should pass through low-density regions

2. Cluster Assumption

Data points tend to form distinct clusters
Points in the same cluster are likely to have the same label

3. Manifold Assumption

High-dimensional data lies on a low-dimensional manifold
Learning the manifold structure from unlabeled data helps classification

Types of Semi-Supervised Learning

Inductive Semi-Supervised Learning

Goal: Learn a function that can predict labels for unseen data
Uses labeled and unlabeled data during training
Once trained, can make predictions without unlabeled data

Transductive Semi-Supervised Learning

Goal: Predict labels for specific unlabeled examples used during training
No generalization to new, unseen data points
Example: Graph-based methods that propagate labels directly

Common Approaches

Self-Training (Pseudo-Labeling)

Train a model on labeled data
Use the model to predict labels for unlabeled data
Add high-confidence predictions to the labeled dataset
Retrain the model iteratively

Co-Training

Train multiple models on different views/features of the data
Each model labels unlabeled data for the other models
Requires data with naturally occurring different views or artificially split features

Generative Models

Model the joint distribution of data and labels
Use labeled data to learn conditional distributions
Use unlabeled data to better estimate the data distribution

Graph-based Methods

Construct a graph where nodes are data points
Connect similar instances with weighted edges
Propagate labels from labeled to unlabeled nodes based on graph structure

Semi-Supervised Support Vector Machines (S3VM)

Extend traditional SVMs to include unlabeled data
Find a decision boundary that separates labeled data while passing through low-density regions

Performance Considerations

When Semi-Supervised Learning Works Well

When assumptions hold true for the data
When labeled data is scarce but high quality
When unlabeled data provides useful structure information

When It Can Fail

When assumptions are violated
When labeled data is too scarce to bootstrap learning
When poor predictions on unlabeled data lead to error propagation

Applications

Text Classification: Using small sets of labeled documents with large unlabeled corpora
Image Recognition: Leveraging abundant unlabeled images with few labeled examples
Medical Diagnosis: Using limited diagnosed cases with many undiagnosed medical records
Speech Recognition: Combining transcribed and untranscribed audio samples
Protein Structure Prediction: Using known structures to help predict unknown ones
Web Content Classification: Categorizing web pages with limited manual annotations

Evaluation

Evaluating semi-supervised learning methods requires careful consideration:

Hold-out labeled data for testing
Compare against supervised learning with only labeled data
Compare against unsupervised + supervised two-step approaches
Measure performance as a function of labeled/unlabeled ratio

Recent Advances

MixMatch: Combines consistency regularization with entropy minimization
FixMatch: Simplifies semi-supervised learning with consistency regularization
UDA (Unsupervised Data Augmentation): Uses data augmentation for consistency regularization
Mean Teachers: Temporal ensembling approach using model weight averaging
Virtual Adversarial Training: Adds adversarial perturbations to enforce consistency

By effectively leveraging both labeled and unlabeled data, semi-supervised learning offers a powerful approach for many real-world problems where labeled data is limited but unlabeled data is plentiful.

Introduction to Semi-Supervised Learning

The Need for Semi-Supervised Learning​

Core Assumptions​

1. Smoothness Assumption​

2. Cluster Assumption​

3. Manifold Assumption​

Types of Semi-Supervised Learning​

Inductive Semi-Supervised Learning​

Transductive Semi-Supervised Learning​

Common Approaches​

Self-Training (Pseudo-Labeling)​

Co-Training​

Generative Models​

Graph-based Methods​

Semi-Supervised Support Vector Machines (S3VM)​

Performance Considerations​

When Semi-Supervised Learning Works Well​

When It Can Fail​

Applications​

Evaluation​

Recent Advances​